Phase 3: EDA

Summary



** Importing dataset and important libraries **

Understanding the dataset in depth

We will start the basic exploration of dataset before finding trends. This will help us gain better understanding of the dataset and might even lead to something totally new.


---> We will start by finding the total number of users. We had already completed the operation but lets go over it again.

We have exactly 206209 user data available in this dataset.


---> Moving on, lets find the range of total number of orders placed by each user.

The range of orders is completely skewed to the left. The average is approx 16 orders. Now we know that this dataset is not being dominated by certain number of user [which was our speculation previously].


---> Now we will find:

Aisle count for each Department

Product count for each Aisle

Now that we have merged all the tables, lets check for any null values

No null values, nice! Let's move on to completing the main task


Aisle count for each department

Hmm... there is department named missing,lets explore this issue in depth.

As we can see both department and aisle names are missing for these products. Lets check the total number of departments and aisles having "missing" as their names.

Speculation: There is only going to be one aisle and department named as missing. We can confidently say this because during the Data Cleaning phase we had checked the aisles.csv and departments.csv datasets for duplicate values and found none. Hence there is a chance that aisle_id 100 & department_id 21 are the only aisle and department having the name missing.

Even though their aisle and department names are missing, it does not mean they are not present in the transactional dataset. We already know that few products are missing from the main transactional dataset. We wil check later how many products belong to aisle 100.


Product count for each aisle

Missing Aisle i.e. aisle 100 has the highest number of products which is then followed by candy chocolate. From top 15 aisle we can see that this online grocery store serves in variety of aisles, each having a large variety of products.

Top 15 aisles with least amount of products. Apart from the first aisle i.e. "bulk dried fruits vegetables" rest have good amount of products associated with them.

---> Finding missing products from the main transactional dataset[ i.e.order_products__prior.csv]

Only 3 products are absent from the dept and aisle with missing name. We can also see that total of 11 products are not present in the main transactional dataset. To further investigate upon as to why these products are not present, this dataset is not suffucient to answer that and will require further investigation with the stakeholders.

The following are the questions that will discover all the information stored in this dataset. This information can be used by Stakeholders or Marketing teams in ways suiting their needs.

  1. Each product's Frequency / Count.
  2. Analyzing purchase activity of customers for each day in a week.
  3. Analyzing purchase activity of customers for each hour in a day.
  4. Frequency / Count of reordered products.
  5. Identify potential products that are ordered after few days of gap
  6. Comparing each Aisle's Product sales vs. Total number of products.

--> 1. Each product's Frequency / Count.

We have already completed this process previously and have also saved the results in a csv file. All we need to do is add product names and our result will be ready

Now we know that 11 products are missing from the transactional dataset. And as we have performed the outer join, there must be some null values.

Seems like frequency has null values. The reason only the frequency column has null values is because all product id were present in the product dataset but all products frequency were not present in the transactional dataset hence frequency has null values

The above output shows exactly 11 products that were not present in the main transactional dataset and hence there frequency is missing. We will fill nan values with zero

This information can be used in lots of ways. Mainly, it can be used to analyze Top 10 performing products and Worst 10 performing products.


--> 2. Analyzing purchase activity of customers for each day in a week.

Answering this question will show us the most active days within a week.

Given the above graph, we can discern that 0 and 1 are weekends. 0 is Saturday and 1 is Sunday. 0 and 1 both have the high and almost the same activity level. While from 2 to 6 seem to have fairly the same amount of activity level.

Having high activity level on Saturday and Sunday is expected and nothing out of the ordinary can be seen in the above graph.


--> 3. Analyzing purchase activity of customers for each hour in a day.

For Day 0

For Day 1

For Day 2

For Day 3

For Day 4

For Day 5

For Day 6

As expected Day 0 and Day 1 had high overall activity within a span of 24 hours among rest of the days. Also as we can see in the output Day 0 and Day 1 have 46.47 % more consumer traffic at their peak hours as compared to Day 2 to Day 6.


--> 4. Frequency / Count of reordered products.

There seem to be a lot of products missing. This may because these products were filltered out as they were never reordered. Lets confirm this speculation.

Lets get product names for the corresponding product ids. And save the dataframe in a csv file format

This information shows us which products are most liked by the consumers and are likely to increase in sale. This can also help stakeholders compare similar products that are from different companies.


--> 5. Identify potential products that are ordered after few days of gap

There are two peaks in the graph. 1st is at 7 and 2nd is at 30. This means that people tend to place orders after 7 days and 30 days majority of time.


We will now identify products that are ordered after gap of 7 days

First we got all the order ids that are placed after 7 days gap


Now we created a new dataframe that consisted transactional data of only the extracted order ids


Finally, we created final dataframe of product's frequency and added product names, aisle ids and department ids and saved it to a csv file.

This insight can convey two important pieces of information.

  1. This insight shows us which products are getting sold at a faster rate. This means that profit gained on these products is much faster as compared to other products.

  2. Another solution that stakeholders can implement through this insight is, to effectively utilize warehouse storage.


---> 6. Comparing each Aisle's Product sales vs. Total number of products.

Here, we will try to understand whether this company (i.e. Instacart) is investing in "profitable" products or not. Instead of pinpointing the sale of each product, we are going to check the sale for each aisle. Doing this will help us get the sense of bigger picture.

For example: Lets say there are three aisles in total i.e. A , B and C. The variety of products offered by each aisle (total number of products) is 3 , 7 , 4 respectively. Now if we look at the product sales data for each aisle it looks something like this

aisle |   total_products_offered  |   products_sold

A                 3                        30

B                 7                        12

C                 4                        25

We can clearly notice that Aisle B has lowest product sale and Aisle B has the highest total number of products. Hence, we can reduce the investment in Aisle B and redirect investment in Aisle A and C which are clearly performing much better.

Creating a multiset barchart for two categories

  1. Top 10 Aisles with highest number of products sold
  2. Top 10 Aisles with highest number of unique total products offered

The above two graphs clearly show that apart from the Aisles "Yogurt", "Chips pretzels" and "Packaged Cheese", the aisles that have sold the most products have less variety of products as compared to the aisles that have large variety of products and dont contribute much towards total sale.

We will now analyze further to get the clear impact of both Top 10 categories.

We can clearly notice in both donut charts above that, The 10 aisles that make 46 % of total sales in the dataset have only 11.9 % of variety of products.

On the other hand, the 10 aisles that make 15.2 % of total sales in the dataset have around 20.5 % of variety of products.

It would be beneficial to bring more variety of investment in the 10 aisles that make around 46 % of total sales. Doing this will help business grow and good investment can be made.